We will work through several exercises using tidycensus to fetch, wrangle, and map census data.

Setup

Be sure to clone or downloaded and unzip the workshop files from: https://github.com/dlab-berkeley/Census-Data-in-R

Then:

  1. Open the folder with the workshop files

  2. Double-click on the R Project file Census-Data-in-R.Rproj

  3. This should open RStudio - with the Files panel displaying the workshop folder contents.

  4. Double-click on the file Census-Data-in-R.Rmd to follow along!

You can also click on the file Census-Data-in-R.html in the Files tab to open the workshop tutorial in a web brower.

Install packages

If you installed any of these packages awhile ago, (especially tidycensus), it’s a good idea to install updates when you can (though not during the workshop as things can break!).

# Uncomment this to install packages, if necessary.
# install.packages(c("here", "tidyverse", "sf", "leaflet", "mapview", "tigris", "tidycensus"))

library(here)
## here() starts at /Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.1     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.1.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(leaflet)
library(mapview)
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(tigris)
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
library(tidycensus)
## 
## Attaching package: 'tidycensus'
## The following object is masked from 'package:tigris':
## 
##     fips_codes

These seven libraries should be loaded in your environment now.

# If you run this chunk, output from the "here" function should be visible below. This is your local directory path. We can use this to import files later on. 
here()
## [1] "/Users/pattyf/Documents/Dlab/workshops/AY2022/Census-Data-in-R-Sp22"

Census API Key

You need a Census API key to programmatically fetch census data.

For more info on all available Census APIs see: https://www.census.gov/data/developers/data-sets.html

Census Data

We’ll begin by fetching US Census data with the tidycensus R package

The tidycensus package allows R users to quickly fetch data from a select subset of Census databases.

  • Many of the code samples in this tutorial borrow heavily from Kyle Walker’s tidycensus package documentation.

The key tidycensus functions we will use today are:

  • census_api_key: makes your Census API key available to tidycensus

  • load_variables: retrieves a dataframe of available census data variables

  • get_decennial: fetch census data for the most recent decennial censuses - 2000, 2010 (and soon 2020)

  • get_acs: fetch 2005 - 2020 data from the 1 and 5 year ACS (American Community Survey) data bases

Add Your Census API Key

Let’s get started!

  1. Copy and paste your Census API key from your email

  2. Use the tidycensus function census_api_key to register your API key with tidycensus. Don’t forget to put quotes around the key!.

# Install your census api key - long alphanumeric string
census_api_key("THE_BIG_LONG_ALPHANUMERIC_API_KEY_YOU_GOT_FROM_CENSUS")

Another way to add your Census API Key:

I keep my key in a file so no one can see it. One way to do this is by making a script that creates a variable key, and then using the source function to add that script as an object into your coding environment. The code chunk below is an example of how you might do that:

# source (run) an r script that creates a variable with my key
source("../../keys/census_api_key.R")

#print(my_census_api_key) 

# register the key
census_api_key(key = my_census_api_key)

Decennial Census Data

The get_decennial function:

We start by fetching from the 2010 Census with tidycensus’s get_decennial function. Let’s first talk about the code.

pop2010 <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010)           # census year
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010)          
## # A tibble: 6 x 4
##   GEOID NAME       variable    value
##   <chr> <chr>      <chr>       <dbl>
## 1 01    Alabama    P001001   4779736
## 2 02    Alaska     P001001    710231
## 3 04    Arizona    P001001   6392017
## 4 05    Arkansas   P001001   2915918
## 5 06    California P001001  37253956
## 6 22    Louisiana  P001001   4533372

tidycensus data is tidy

By default, tidycensus returns data in a tidy, or long format that allows data for multiple variables to be contained within the variable and value columns. This is in contrast to untidy, or wide data where each variable is in its own column.

tidycensus can return long data if you can add the parameter output=wide to the function call.

# wide format
pop2010w <- get_decennial(geography = "state",   # census tabulation unit
                         variables = "P001001", # variable(s) of interest
                         year = 2010,           # census year
                         output="wide")         # get output in wide format
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
head(pop2010w) 
## # A tibble: 6 x 3
##   GEOID NAME        P001001
##   <chr> <chr>         <dbl>
## 1 01    Alabama     4779736
## 2 02    Alaska       710231
## 3 04    Arizona     6392017
## 4 05    Arkansas    2915918
## 5 06    California 37253956
## 6 22    Louisiana   4533372

The GEOID column

The GEOID column is included in tidycensus output.

This is a Census geographic identifier for the tabulation unit.

The GEOID is sometimes called the Census FIPS code and for most tabulation units these are the same.

  • This makes it possible to link to Census geographic data and make maps. (We will do this in a bit.)

The GEOID is a text string and must be quoted.

  • Beware of GEOID leading zeros, since some software will remove these and convert GEIOD values to numbers (rather than text strings).

Question:

What is the GEOID for California?

Census Tabulation Units

Decennial census data is gathered from individuals and publicly distributed in aggregated form to protect privacy.

The Census tabulation units are the Census geographies to which the census data have been aggregated.

Some of the most common geographic tabulation units and their tidycensus function abbreviations are shown below, along with required and available filters that limit what data are returned.

Geography Definition Filter(s) Used in tidycensus
“us” United States get_acs(), get_decennial()
“region” Census region get_acs(), get_decennial()
“state” State or equivalent state get_acs(), get_decennial()
“county” County or equivalent state, county get_acs(), get_decennial()
“place” Census place state get_acs(), get_decennial()
“tract” Census tract state, county get_acs(), get_decennial()
“block group” Census block group state, county get_acs(), get_decennial()
“block” Census block state, county get_decennial() only!

get_decennial Geographic Tabulation Units and Filters

Let’s work together to fill in the code to fetch state population in 2010 just for California. You can find the code in the file Solutions.R.

  • See ?get_decennial for help

Challenge 1

Open Census-Data-in-R-Challenges.Rmd and use the get_decennial function like we’ve seen above, but fill in the code arguments to fetch State population in 2010 just for California. Solutions are available in the Solutions folder, as needed.

Change the tabulation unit

Let’s fetch 2010 population data for CA counties

What changes in the code?

get_decennial(geography = "county",              # census tabulation unit
                          variables = "P001001", # variable(s) of interest
                          year = 2010,           # census year
                          state='CA')            # Filter by state is CA
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 58 x 4
##    GEOID NAME                            variable   value
##    <chr> <chr>                           <chr>      <dbl>
##  1 06011 Colusa County, California       P001001    21419
##  2 06007 Butte County, California        P001001   220000
##  3 06001 Alameda County, California      P001001  1510271
##  4 06003 Alpine County, California       P001001     1175
##  5 06005 Amador County, California       P001001    38091
##  6 06009 Calaveras County, California    P001001    45578
##  7 06013 Contra Costa County, California P001001  1049025
##  8 06015 Del Norte County, California    P001001    28610
##  9 06031 Kings County, California        P001001   152982
## 10 06021 Glenn County, California        P001001    28122
## # … with 48 more rows

Questions

  • How do we specify the state of CA above? How else can we?
  • Can you fetch population data for all counties in the USA or do you need to have a state= filter?

Adding a county filter

You can also filter tidycensus results by county

get_decennial(geography = "county",              # census tabulation unit
                          variables = "P001001", # variable(s) of interest
                          year = 2010,           # census year
                          state='CA',            # Filter by state is CA
                          county='Alameda')      # Filter by county Alameda
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
## # A tibble: 1 x 4
##   GEOID NAME                       variable   value
##   <chr> <chr>                      <chr>      <dbl>
## 1 06001 Alameda County, California P001001  1510271

Challenge 2

In Census-Data-in-R-Challenges.Rmd, alter the code above to fetch 2010 population for Alameda & San Francisco Counties.

Visualize Results

We can visualize the data to get a quick overview of the distribution of data values.

It’s a first step in exploratory data analysis and a last step in data communication.

ggplot2 is the most commonly used R package for data visualization.

  • It is loaded when you load the tidyverse package.

Let’s use it to visualize the population data.

Plot 2010 Population by state

Use ggplot2 to create an ordered horizontal bar chart.

pop_plot <- ggplot(data=pop2010, 
                  aes(x=value, y=reorder(NAME,value)) ) + 
                  geom_point()

# display the plot
pop_plot

You can get real fancy with ggplot if you like.

# create a plot.
pop_plot <- ggplot(data=pop2010, 
                   # set aesthetic variables
                   aes(x=value/1000000, y=reorder(NAME,value)) ) + 
                   # pick geometry
                   geom_bar(stat="identity") +  
                   # add theme and titles. 
                   theme_minimal() + 
                   labs(title = "2010 US Population by State") +
                   xlab("Population (in Millions)") +
                   ylab("State")

# display the plot.
pop_plot

Identifying Census Variables

In the code above we fetched data for total population in 2010 using the variable "P001001".

That is not an obvious variable name, so how do we get those identifiers?

We can use the tidycensus load_variables function for this.

load_variables function

Use load_variables to fetch all variables used in the 2010 census into a dataframe.

vars2010 <- load_variables(year=2010,        # Year or end year for ACS-5yr
                           dataset = 'sf1',  # 'sf1' for decennial census
                           cache = TRUE)     # Save fetched data locally

# How large is the output
dim(vars2010)
## [1] 8959    3
# Take a look with head or View
head(vars2010)
## # A tibble: 6 x 3
##   name    label                                concept        
##   <chr>   <chr>                                <chr>          
## 1 H001001 Total                                HOUSING UNITS  
## 2 H002001 Total                                URBAN AND RURAL
## 3 H002002 Total!!Urban                         URBAN AND RURAL
## 4 H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
## 5 H002004 Total!!Urban!!Inside urban clusters  URBAN AND RURAL
## 6 H002005 Total!!Rural                         URBAN AND RURAL

2010 Decennial Census Tables & Variables

  • Over 3,000 unique variables that describe population and housing characteristics

  • Organized in 333 Tables

    • 177 population tables (identified with a ‘‘P’’) available to the block level
    • 58 housing tables (identified with an ‘‘H’’) available to the block level
    • 82 population tables (identified with a ‘‘PCT’’) available to the census tract level
    • 4 housing tables (identified with an “HCT”) available to the census tract level
    • 10 population tables (identified with a “PCO”) available to the county level
    • plus 2 additional PCT tables

https://www.census.gov/data/datasets/2010/dec/summary-file-1.html

What Variable Has the 2010 Total Population value?

We can sort and filter the vars2010 dataframe to find it.

Questions

What 2010 decennial census variable contains…

  • Median Age

  • Average Family Size

  • Number of occupied housing units

Answers are in Solutions.Rmd

Challenge 3

Return to Census-Data-in-R-Challenges.Rmd and use the get_decennial function to fetch and plot an Avg Family Sizevaraible by CA County in 2010, and name the call as a dataframe, ca_fam_size. Once you’ve done that, plot the dataframe with the ggplotcall below. Hint: “P037001”

Challenge 4

Repeat the previous challenge with data from the 2000 decennial census. Don’t assume variable names are the same across the 2000 and 2010 census

Use load_variables to check!

Census Tract Data

Census tracts are the most commonly used census tabulation unit.

Let’s fetch population data for the census tabulation unit to tract

Because of the large number of census tracts, you MUST specify a state when requesting these data with tidycensus.

## Fetch population by **tract** for California.
ca_tract_pop2010 <- get_decennial(geography = "tract",   # census tab unit
                                   variables = "P001001", # var of interest
                                   year = 2010,           # census year
                                   state='CA')      # State filter
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# How many tracts in CA
dim(ca_tract_pop2010)
## [1] 8057    4
# take a look
head(ca_tract_pop2010)
## # A tibble: 6 x 4
##   GEOID       NAME                                                variable value
##   <chr>       <chr>                                               <chr>    <dbl>
## 1 06037434004 Census Tract 4340.04, Los Angeles County, Californ… P001001   2796
## 2 06037460000 Census Tract 4600, Los Angeles County, California   P001001   4851
## 3 06037460200 Census Tract 4602, Los Angeles County, California   P001001   5315
## 4 06037460301 Census Tract 4603.01, Los Angeles County, Californ… P001001   4638
## 5 06037460302 Census Tract 4603.02, Los Angeles County, Californ… P001001   4442
## 6 06037460401 Census Tract 4604.01, Los Angeles County, Californ… P001001    878

Fetching Census Tract Data

Census tract data can be quite large!

Fortunately, you can also limit the results to one or more counties.

tract_pop2010 <- get_decennial(geography = "tract",   # census tabulation unit
                         variables = "P001001",       # variable of interest
                         year = 2010,                 # census year - only one!
                         state="CA",                  # limit to California
                         county=c("Alameda","Contra Costa"))  # & counties
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
dim(tract_pop2010)
## [1] 569   4

Fetching data for more than one census variable

What three things are new here?

#urban and rural pop for 3 CA counties
ur_pop10 <- get_decennial(geography = "county",  # census tabulation unit
                           variables = c(urban="P002002",rural="P002005"),
                           year = 2010, 
                           summary_var = "P002001",  # The denominator
                           state='CA',
                           county=c("Napa","Sonoma","Mendocino"))
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Fetching data for more than one census variable

  1. You can specify more than one variable:
variables = c("P002002","P002005")
  1. You can rename the values in the output ‘variable’ column.
variables = c(urban="P002002",rural="P002005")
  1. You can identify a summary_var (a denominator - here, the total count of all people or households surveyed. Can be used for calculations like percent of total.)
summary_var = "P002001"
# take a lookat the results
ur_pop10
## # A tibble: 6 x 5
##   GEOID NAME                         variable  value summary_value
##   <chr> <chr>                        <chr>     <dbl>         <dbl>
## 1 06045 Mendocino County, California urban     48110         87841
## 2 06055 Napa County, California      urban    118194        136484
## 3 06097 Sonoma County, California    urban    424102        483878
## 4 06045 Mendocino County, California rural     39731         87841
## 5 06055 Napa County, California      rural     18290        136484
## 6 06097 Sonoma County, California    rural     59776        483878

Calculating Percents

The summary_value column comes in handy when you want to compute percent of total, for example:

# Calculate the percent of population that is Urban or Rural
ur_pop10 <- ur_pop10 %>%
            mutate(pct = 100 * (value / summary_value))

# Take a look at the output.
ur_pop10 
## # A tibble: 6 x 6
##   GEOID NAME                         variable  value summary_value   pct
##   <chr> <chr>                        <chr>     <dbl>         <dbl> <dbl>
## 1 06045 Mendocino County, California urban     48110         87841  54.8
## 2 06055 Napa County, California      urban    118194        136484  86.6
## 3 06097 Sonoma County, California    urban    424102        483878  87.6
## 4 06045 Mendocino County, California rural     39731         87841  45.2
## 5 06055 Napa County, California      rural     18290        136484  13.4
## 6 06097 Sonoma County, California    rural     59776        483878  12.4

Plots give us compact visual summaries of the data.

## Plot it
myplot <- ggplot(data = ur_pop10, 
          mapping = aes(x = NAME, fill = variable, 
                     y = ifelse(test = variable == "urban", 
                                yes = -pct, no = pct))) +
          geom_bar(stat = "identity") +
          scale_y_continuous(labels = abs, limits=c(-100,100)) +
          labs(title="Urban & Rural Population in Wine Country", 
               x="County", y = " Percent of Population", fill="") +
          coord_flip()

myplot

Don’t worry if you don’t get all the ggplot code now. It’s here for reference.

  • You may want to check out our R Data Visualization with ggplot workshop!

Data Wrangling

Your R skills can help you reformat the data and make it more usable.

Let’s fetch population data for 2010 and 2000 by state.

Then we will combine these into one data frame using the tidyverses::bind_rows function

# Fetch 2000 population data by state
pop2000 <- get_decennial(geography = "state",
                         variables = c(pop2000="P001001"), 
                         year = 2000)
## Getting data from the 2000 decennial Census
## Using Census Summary File 1
# Fetch 2010 population data by state
pop2010 <- get_decennial(geography = "state",
                         variables = c(pop2010="P001001"), 
                         year = 2010)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Use tidyverse `bind_rows` function to combine the data for these years
state_pop <- bind_rows(pop2000, pop2010)

# Take a look with head or View
head(state_pop)
## # A tibble: 6 x 4
##   GEOID NAME       variable    value
##   <chr> <chr>      <chr>       <dbl>
## 1 01    Alabama    pop2000   4447100
## 2 02    Alaska     pop2000    626932
## 3 04    Arizona    pop2000   5130632
## 4 05    Arkansas   pop2000   2673400
## 5 06    California pop2000  33871648
## 6 08    Colorado   pop2000   4301261

Any Questions?

Mapping

Mapping Census Data with tidycensus

You can fetch census geographic data by adding the parameter geometry=TRUE to tidycensus functions

  • Under the hood, tidycensus calls the tigris package to fetch data from the Census Geographic Data APIs.

  • Only a subset of data available via tigris can be accessed via tidycensus.

You can then use your favorite R mapping functions or libraries like plot, ggplot, and tmap to make maps.

Geometry Options

Before fetching census geographic data, we need to set the option tigris_use_cache to TRUE

Caching saves data locally. This greatly speeds things up if you fetch the same census geographic data repeatedly.

# Tigris options - used by tidycensus
# Cache retrieved geographic data locally
options(tigris_use_cache = TRUE)  

Fetch Geographic Boundary Data with tidycensus

We fetch the geospatial data by setting geometry=TRUE.

pop2010geo <- get_decennial(geography = "state", 
                          variables = c(pop10="P001001"), 
                          year = 2010, 
                          output="wide", 
                          geometry=TRUE) # Fetch geometry data for mapping
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

Take a look

Let’s take a minute to discuss the format of an sf spatial object.

head(pop2010geo, 3)
## Simple feature collection with 3 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -90.41814 ymin: 41.23796 xmax: -66.9499 ymax: 48.19097
## Geodetic CRS:  NAD83
## # A tibble: 3 x 4
##   GEOID NAME        pop10                                               geometry
##   <chr> <chr>       <dbl>                                     <MULTIPOLYGON [°]>
## 1 23    Maine      1.33e6 (((-67.61976 44.51975, -67.61541 44.52197, -67.58774 …
## 2 25    Massachus… 6.55e6 (((-70.83204 41.6065, -70.82373 41.59857, -70.82092 4…
## 3 26    Michigan   9.88e6 (((-88.68443 48.11578, -88.67563 48.12044, -88.67639 …

Geospatial Data in R

The tidycensus package uses the R sf package to manage geospatial data.

R sf objects include:

  • a dataframe with a geometry column with the name geometry

    • The geometry can be of type POINT, LINE, POLYGON
    • or, MULTIPOINT, MULTILINE or MULTIPOLGYON
  • a CRS (coordinate reference system), specified by

    • epsg(SRID) code
    • proj4string

For a deeper understanding of the sf package and its functionality, we recommend

Census Data Coordinate Reference System (CRS)

All geospatial data are referenced to the surface of the earth with a CRS, or coordinate reference system. Anyone working with geospatial data will need to develop an understanding of CRSs.

Fortunately, many of us are familiar with longitude and latitude, which are geographic coordinates. But there are different versions of geographic CRSs. And there are also projected CRSs which transform longitude and latitude to 2 dimensional surface for mapping & analysis.

All census geographic data use the NAD83 geographic CRS. NAD83 stands for North American Datum of 1983. This CRS is best for locations in North America.

Many geospatial operations require you transform data to a common CRS before conducting spatial analysis or mapping.

  • This could be an issue if you try to combine the census geospatial data with other geospatial data. But it is not an issue in this tutorial.

An in-depth discussion of CRSs is outside the scope of this workshop. See Geocomputation in R for more information.

Mapping sf Spatial Objects

We can use sf::plot to make a quick map the geometry stored in an sf spatial object.

# plot the geometry column data
plot(pop2010geo$geometry)

The Challenge of US maps

The vast geographic extent and non-contiguous nature of the USA makes it difficult to map.

Fortunately, tidycensus includes a shift_geo parameter to shift AK & HI to below Texas.

pop2010geo_shifted <- get_decennial(geography = "state", 
                                    variables = c(pop10="P001001"), 
                                    output="wide",
                                    year = 2010, 
                                    geometry=TRUE, 
                                    shift_geo=TRUE)
## Warning: The `shift_geo` argument is deprecated and will be removed in a future
## release. We recommend using `tigris::shift_geometry()` instead.
## Getting data from the 2010 decennial Census
## Using feature geometry obtained from the albersusa package
## Using Census Summary File 1
## Please note: Alaska and Hawaii are being shifted and are not to scale.
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## Shift Happens!
plot(pop2010geo_shifted$geometry)

Save it

You can save any sf data object to a shapefile using st_write

st_write(pop2010geo_shifted, here("data_out/usa_pop2010_shifted.shp"))
# Check to see if the data was written out to a shapefile
dir(here("data_out")) 

Mapping Data Values

You can use the sf plot command to make a map that sets the color of the geometry by the data values

# Name the column with the variable values to make
# a thematic map, also called a choropleth map.
plot(pop2010geo_shifted['pop10'])  

ggplot2 Map

ggplot knows what to do with sf objects!

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf()  # tells ggplot that geographic data are being plotted

Let’s make that map a little nicer to look at.

ggplot(pop2010geo_shifted, aes(fill = pop10)) + 
  geom_sf(color=NA) + # What does color=NA do
  coord_sf(crs = 3857) + # Dynamically change the CRS
  scale_fill_viridis_c(option = "viridis")  # Change the color palette

                                            # Try different options, e.g.
                                            # plasma, magma, inferno, cividis

Challenge 5

In your Census-Data-in-R-Challenges.Rmd file, mreate a map of CA Median Age by county in 2010. Solutions are in the Solutions.Rmd file

Fetch Census Data and Geometry for Multiple States or Counties

We can fetch Census data and the geometry for more than one state or county with same function call.

  • This is so much easier than any alternative approach!

  • It can be applied to any available geographic tabulation areas (eg states, counties, tracts, places).

Let’s try it with Census Tracts!

Fetch tract population and geometry data for Bay Area Counties.

bay_counties <- c("Alameda", "Contra Costa", "Marin", "San Francisco",
                  "Sonoma", "Napa","Solano", "San Mateo", "Santa Clara")

bayarea_pop10 <- get_decennial(geography = "tract", 
                      variables = "P001001", 
                      year = 2010, 
                      state='CA',
                      county=bay_counties,
                      geometry=T)
## Getting data from the 2010 decennial Census
## Using Census Summary File 1
# Quick map
plot(bayarea_pop10['value'])

Questions?

ACS 5-year data

  • ACS Data with get_acs

  • ACS data contains the most recent information about the American population.

  • We can use the tidycensus function get_acs to retrieve ACS data using code very similar to get_decennial.

BUT the workflow is more complex because:

  1. The ACS has a lot more tables and variables, and

  2. The ACS contains sample data, so each ACS variable that you retrieve with tidycensus will fetch both an estimate of the value and a margin of error.

ACS Data Products

The ACS has two primary data products - the ACS 1 year database and the 5 year database.

  • The ACS 3 year data product has been discontinued.

  • The ACS 1 year data is more current but has a larger margin of error and is not available for Census geographies with a population of < 65,000.

  • So the ACS 5-year data is the most commonly used data set.

Fetch List of ACS 5-Year Variables

Let’s use the load_variables function to get a dataframe of all variables from the ACS 2015—2019 5-year dataset.

  • Then, find the variable name for median household income.
vars_acs2019 <- load_variables(year=2019,      # end year 2016-2020 period
                              dataset = 'acs5', # the ACS data product
                              cache = T) # Save locally for future access

# how many variables?
dim(vars_acs2019)
## [1] 27040     3
# Take a look at the resultant dataframe
## What is the variable for median household income?
#View(vars_acs2019)

Fetch Data on Median Household Income

Let’s fetch the median household income data for Alameda County by Census Tract.

alco_mhhincome <- get_acs(geography='tract',
                        variables=c(median_hhincome = "B19013_001"),
                        year = 2019,
                        state='CA',
                        county='Alameda',
                        geometry=TRUE
                        )
## Getting data from the 2015-2019 5-year ACS

Take a look

head(alco_mhhincome)
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -122.2887 ymin: 37.52248 xmax: -121.8779 ymax: 37.81562
## Geodetic CRS:  NAD83
##         GEOID                                             NAME        variable
## 1 06001442301 Census Tract 4423.01, Alameda County, California median_hhincome
## 2 06001437400    Census Tract 4374, Alameda County, California median_hhincome
## 3 06001437701 Census Tract 4377.01, Alameda County, California median_hhincome
## 4 06001402400    Census Tract 4024, Alameda County, California median_hhincome
## 5 06001402500    Census Tract 4025, Alameda County, California median_hhincome
## 6 06001450743 Census Tract 4507.43, Alameda County, California median_hhincome
##   estimate   moe                       geometry
## 1   110761 21966 MULTIPOLYGON (((-121.9701 3...
## 2    86210  9325 MULTIPOLYGON (((-122.0926 3...
## 3    64559  6732 MULTIPOLYGON (((-122.0747 3...
## 4    39913  8581 MULTIPOLYGON (((-122.284 37...
## 5    30000 12436 MULTIPOLYGON (((-122.2879 3...
## 6   128737  9289 MULTIPOLYGON (((-121.9066 3...

Map Median Household Income by tract

What is the variable?

plot(alco_mhhincome['estimate'])

Fetching Multiple ACS-5 Variables

First identify the variables of interest.

# Median household income by race/ethnicity: Variables from ACS 2015—19
#All households =   "B19013_001",
inc_by_race <- c(White = "B19013H_001",
                 Black = "B19013B_001",
                 Asian = "B19013D_001",
                 Hispanic = "B19013I_001" )

Fetch census tract data for multiple variables at once.

# Fetch the Data
alco_mhhinc_by_race <- get_acs(geography='tract',
                                  variables=inc_by_race,
                                  year = 2019,
                                  state='CA',
                                  county='Alameda',
                                  geometry=T )
## Getting data from the 2015-2019 5-year ACS

Facet Map

Facet maps are a way to create visualizations of small multiples, or subsets of the data in order to facilitate comparisons. Here, we use ggplot’s facet_wrap function to make multiple maps of median household income by race for Alameda County.

# Create the map
medhhinc_facet_map <- alco_mhhinc_by_race %>%
                        ggplot(aes(fill = estimate)) +
                          facet_wrap(~variable) +
                          geom_sf(color=NA) +
                          scale_fill_viridis_c(option="magma")

# Display the map
medhhinc_facet_map

#

Challenge

Make a ggplot map of MEDIAN GROSS RENT (DOLLARS) in San Francisco County by tract using data from the ACS 2015—2019 5-year product.

# Fill in the code to fetch the data - (Solutions.R has the code)

# Median household rent for San Francisco County
alcc_medrent <- get_acs(geography= ,
                              variables= ,
                              year = ,
                              state= ,
                              county= ,
                              geometry=)
=======
>>>>>>> 28914a878c5f685ff5f665f2dd8401dee89f99f1:Lessons/Census-Data-in-R.Rmd

Challenge 6

In Census-Data-in-R-Challenges.Rmd file, Make a ggplot map of MEDIAN GROSS RENT in San Francisco County by tract using data from the ACS 2015—2019 5-year product. Check Solutions.Rmd for answers, as needed.

Interactive Mapping

Interactive mapping gives the RStudio environment some of the functionality of desktop GIS.

There are a number of R packages that you can use, including:

  • mapview: quick interactive exploratory data viewing

  • tmap: great static and interactive maps

  • Leaflet: highly customizable interactive maps

All of these are based on the Leaflet Javascript Library.

Interactive Choropleth Maps

Let’s use mapview to make some quick interactive maps of our median hhousehold income data

mapview(alco_mhhincome)
mapview(alco_mhhincome, zcol="estimate")

Leaflet is the ggplot2 of interactive mapping. Leaflet in R follows a tidyverse convention, using pipes (%>%) to create layers in the mapping object. We can use leaflet to create interactive maps allowing for more flexibility in design and features we can create. With added complexity in the code, of course!

# Create a color palette
pal <- colorNumeric(
  palette = "YlOrRd",
  domain = alco_mhhincome$estimate
)

# specify dataset
leaflet(alco_mhhincome) %>% 
   addProviderTiles(providers$CartoDB.Positron) %>% 
  # adjust color palette and ploygon features.
  addPolygons(stroke = FALSE, smoothFactor = 0.2, fillOpacity = .5,
     color = ~pal(estimate)) %>% 
  # add legend 
  addLegend(pal = pal, values = ~estimate,
            title = "Median Income",
            labFormat = labelFormat(prefix = "$"),
            position = "bottomleft")
## Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
## Need '+proj=longlat +datum=WGS84'

Challenge 7

In the Census-Data-in-R-Challenges.Rmd, use mapview to create an interactive choropleth map of median household rent.

Figuring out the ACS Variables to use

ACS variables can be confusing.

Some ways to identify the best variables to explore:

  • Web search, especially Census web resources, can help.

  • The Census Reporter website (https://censusreporter.org) provides another tool for navigating topics, tables, and variable names.

  • The NHGIS website (nhgis.org) is a great way to browse variables of interest.

Margins of Error (MOE)

We haven’t talked about it but it may be important in your work with ACS data.

Math is needed to combine MOEs when you combine variables.

  • tidycensus includes some nice functions for these calculations and a good overview of the topic.

Summary

tidycensus offers two key functions for fetching census tabular and geographic: get_acs and get_decennial.

  • The load_variables function helps identify the names of census variables of interest.

Support for fetching population estimates and migration flow census data was recently added to tidycensus. You can read up on it on the tidycensus documentation website

Using tidycensus to fetch the tabular data or both tabular and geographic data is IMO way easier than any alternatives, IF you (1) know R, (2) know a bit about working with geographic data in R.

This approach is also scaleable if you want multiple census variables for various locations and tabulation areas.

You can greatly enhance your maps if you make them with ggplot2 rather than the default plot command.

Interactive mapping greatly enhances your ability to do exploratory data analysis in RStudio.

References

Much of this tutorial is based on resources by Kyle Walker, author of tidycensus. See:

Related D-Lab Workshops

Great online resource for working with spatial data in R


Appendix

A More Complex Query

Let’s use the 2010 census data to map the percent of San Francisco (SF) properties that were rented.

To start, identify the variables for the

  • Total number of housing units

  • Number of renter occupied units

Complete the query

sf_rented <- get_decennial(geography =  ,  # census tabulation unit
                           variables =   , # number of households rented
                           year =  , 
                           summary_var = ,  # Total households
                           state=,
                           county=,
                           geometry=)

And here it is SF Percent Rented Units, 2010

sf_rented <- get_decennial(geography = "tract",  # census tabulation unit
                           variables =  "H004004", #number of households rented
                           year = 2010, 
                           summary_var = "H004001",  # Total households
                           state='CA',
                           county='San Francisco',
                           geometry=T)

# take a look at the output
head(sf_rented)

Calculate Percent Rented

sf_pct_rented <- sf_rented[sf_rented$value > 0,] %>%
                 mutate(pct = 100 * (value / summary_value))

# Take a look
head(sf_pct_rented)

Map the result

plot(sf_pct_rented['pct'])